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and Its Applications 
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Abstract —The amount of data in our society has been exploding in the era of big data today. In this paper, we address several 
open challenges of big data stream classification, including high volume, high velocity, high dimensionality, high sparsity, and high 
class-imbalance. Many existing studies in data mining literature solve data stream classification tasks in a batch learning setting, 
which suffers from poor efficiency and scalability when dealing with big data. To overcome the limitations, this paper investigates an 
online learning framework for big data stream classification tasks. Unlike some existing online data stream classification techniques 
that are often based on first-order online learning, we propose a framework of Sparse Online Classification (SOC) for data stream 
classification, which includes some state-of-the-art first-order sparse online learning algorithms as special cases and allows us to 
derive a new effective second-order online learning algorithm for data stream classification. In addition, we also propose a new cost- 
sensitive sparse online learning algorithm by extending the framework with application to tackle online anomaly detection tasks where 
class distribution of data could be very imbalanced. We also analyze the theoretical bounds of the proposed method, and finally conduct 
an extensive set of experiments, in which encouraging results validate the efficacy of the proposed algorithms in comparison to a family 
of state-of-the-art techniques on a variety of data stream classification tasks. 

Index Terms —online learning; sparse learning; classification; cost-sensitive learning. 

- ♦ - 


1 Introduction 

In the era of big data today, the amount of data in 
our society has been exploding, which has raised many 
opportunities and challenges for data analytic research in 
data mining community. In this work, we aim to address 
the challenging real-world big data stream classification 
task, such as web-scale spam email classification. In 
general, big data stream classification has several char¬ 
acteristics: 

• high volume: one has to deal with huge amount 
of existing training data, in million or even billion 
scale; 

• high velocity: new data often arrives sequentially 
and very rapidly, e.g., about 182.9 billion emails are 
sent/received worldwide every day according to an 
email statistic report by the Radicati Group |dj; 

• high dimensionality: there are a large number of 
features, e.g., for some spam email classification 
tasks, the length of the vocabulary list can go up 
from 10,000 to 50,000 or even to million scale; 

• high sparsity: many feature elements are zero, and 
the faction of active features is often small, e.g., the 
spam email classification study in @ showed that 
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accuracy saturates with dozens of features out of 
tens of thousands of features; and 
• high class-imbalance: some class considerably dom¬ 
inates the others, e.g., for spam email classification 
tasks, the number of non-spam (ham) emails is often 
much larger than the number of spam emails. 

The above characteristics present huge challenges for 
big data stream classification tasks when using con¬ 
ventional data stream classification techniques that are 
often restricted to batch learning setting and thus suffer 
from several critical drawbacks: (i) it requires a large 
memory capacity for caching arrived examples; (ii) it is 
expensive to collect and train on the entire data set; (iii) 
it suffers from expensive re-training cost whenever new 
training data arrives; and (iv) their assumption that all 
training data must be available a prior does not hold for 
real-world data stream applications where data arrives 
rapidly in a sequential manner. 

To tackle the above challenges, a promising approach 
is to explore online learning methodology that performs 
incremental training over streaming data in a sequential 
manner. Typically, an online learning algorithm pro¬ 
cesses one instance at a time and makes very simple 
updates with each arriving example repeatedly. In con¬ 
trast to batch learning algorithms, online algorithms are 
not only more efficient and scalable, but also able to 
avoid expensive re-training cost when handling new 
training data, making them more favorite choices for 
solving large-scale machine learning tasks towards big 
data stream applications. In literature, a large variety 
of algorithms have been proposed, including a num¬ 
ber of first-order algorithms 0, (4) and second-order 
algorithms [51, 0, 0. Despite being studied extensively. 
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traditional online-learning algorithms suffer from critical 
limitation for high-dimensional data. This is because 
they assume at least one weight for every feature and 
most of the learned weights are often nonzero, making 
them of low efficiency not only in computational time 
but also in memory cost for both training and test 
phases. Sparse online learning (HJ aims to overcome this 
limitation by inducing sparsity in the weights learned 
by an online-learning algorithm. 

In this paper, we introduce a framework of Sparse 
Online Learning for solving large-scale high-dimensional 
data stream classification tasks. We show that the pro¬ 
posed framework covers some existing first-order sparse 
online classification algorithm, and is able to further 
derive new algorithms by exploiting the second order 
information. The proposed sparse online classification 
scheme is far more efficient and scalable than the tra¬ 
ditional batch learning algorithms for data stream clas¬ 
sification tasks. We further give theoretical analysis of 
the proposed algorithm and conduct an extensive set 
of experiments. The empirical evaluation shows that 
the proposed algorithm could achieve state-of-the-art 
performance. The rest of this paper is organized as fol¬ 
lows. Section 2 reviews related work. Section 3 presents 
our problem formulation. Section 4 proposes our novel 
framework. Section 5 discusses our experimental results, 
and section 6 concludes this work. 

As a summary, our main contributions include: 

• We propose a general online learning framework, 
which can easily derive first order and second order 
algorithms. 

• We provide general theoretical analysis including 
general regret and mistake bounds for the proposed 
algorithms. 

• The proposed algorithms are evaluated on several 
high-dimensional large-scale benchmark databases, 
where the state-of-the-art performances are 
archived. 

2 Related Work 

Our work is closely related to the studies of online 
learning in machine learning and data mining. Below 
we briefly review some important related works. 

2.1 Online Learning 

Online learning represents a family of efficient and 
scalable machine learning algorithms ||9], which would 
online optimize some performance measure including, 
accuracy JU, AUC ffOl . cost-sensitive metrics HU, etc. 
Unlike batch learning methods that suffer from expen¬ 
sive re-training cost, online learning works sequentially 
by performing highly efficient (typically constant) up¬ 
dates for each new training data, making it highly scal¬ 
able for data stream classification. In literature, various 
techniques 02), 0, |4|, [13), [H), Ifl5l . fl6l have been 
proposed for online learning. The well-known first-order 


online learning algorithms include Perceptron 0, [17], 
Passive-Aggressive (PA) algorithms [4)/ etc. 

The most well-known method is the Perceptron algo¬ 
rithm 0, 1 17], which updates the model by adding a 
new example as a support vector with some constant 
weight. Recently, a series of sophisticated online learning 
algorithms have been proposed by following the crite¬ 
rion of maximum margin learning principle [18], [19], 
|4) . One famous algorithm is the Passive-Aggressive (PA) 
algorithm [4), which evolves a classifier by suffering less 
loss on the current instance without moving far from the 
previous function. 

In recent years, the design of many efficient on¬ 
line learning algorithms has been influenced by convex 
optimization tools. Furthermore, it was observed that 
most previously proposed efficient online algorithms 
can be jointly analyzed based on the following elegant 
model (20): 


Algorithm 1 Online Convex Optimization Scheme 

INPUT : A convex set 

for t=l,...,T do 
predict a vector w t e 
receive a convex loss function i t '■ S —> R; 
suffer loss £i(w t ); 

end for 


Based on the previous framework, we can consider 
online learning as an algorithmic framework for convex 
online learning problem: 

min/(w)=min> f t (w), 

W W • ^ 

t 

where /(w) is a convex empirical loss function for the 
sum of losses over a sequence of observations. The regret 
of the algorithm is defined as follows: 

T T 

R t = Y Itiyft) - min V'' f t (w), 

t= 1 t= 1 

where w is any vector in the convex space K . The 
goal of online learning algorithm is to find a low regret 
scheme, in which the regret Rt grows sub-linearly with 
the number of iteration T. As a result, when the round 
number T goes to infinity, the difference between the 
average loss of the learner and the average lost of the best 
learner tends to zero. 

Although the general online learning algorithms (e.g., 
Perceptron and PA) have solid theoretical guarantees 
and performs well on many applications, generally they 
are limited in several aspects. First, the general online 
learning algorithms exploit the full features, which is 
not suitable for large-scale high-dimensional problem. To 
tackle this limitation, the sparse online learning has been 
extensively studied recently. Second, the general online 
learning algorithms only exploit the first order informa¬ 
tion and all features are adopted the same learning rate. 
This problem can be addressed by second order online 
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learning algorithms. Last but not least, the general online 
learning algorithms are not suitable for the imbalance 
input data streams, which can be efficiently solved by the 
cost-sensitive online learning algorithms. In the following 
parts, we will briefly introduce several representative 
algorithms in the previous three aspects. 

2.2 Sparse Online Learning 

Sparse online learning I2T1 . |8) aims to learn a sparse 
linear classifier, which only contains limited size of 
active features. It has been actively studied [21], [22] , 
|[23l . |24|. There are two group of solutions for sparse 
online learning. The first group study on sparse online 
learning follows the general idea of subgradient de¬ 
scent with truncation. For example, Duchi and Singer 
propose the FOBOS algorithm j2T|, which extends the 
Forward-Backward Splitting method to solve the sparse 
online learning problem in two phases: (i) an uncon¬ 
strained subgradient descent step with respect to the 
loss function, and (ii) an instantaneous optimization for 
a trade-off between minimizing regularization term and 
keeping close to the result obtained in the first phase. 
The optimization problem in the second phase can be 
efficiently solved by adopting simple soft-thresholding 
operations that perform some truncation on the weight 
vectors. Following the similar scheme, Langford et al. |8j 
argue that truncation on every iteration is too aggres¬ 
sive as each step modifies the coefficients by only a 
small amount, and propose the Truncated Gradient (TG) 
method which truncates coefficients every K steps when 
they are less than a predefined threshold 9. The second 
group study on sparse online learning mainly follows 
the dual averaging method of [25], can explicitly ex¬ 
ploit the regularization structure in an online setting. 
For example. One representative work is Regularized 
Dual Averaging( RDA) 1221 , which learns the variables 
by solving a simple optimization problem that involves 
the running average of all past subgradients of the lost 
functions, not just the subgradient in each iteration. Lee 
et al. l26l further extends the RDA algorithm by using 
a more aggressive truncation threshold and generates 
significantly more sparse solutions. 

2.3 Second-order Online Learning 

Second Order Online Learning aims to dynamically incor¬ 
porate knowledge of observed data in earlier iteration 
to perform more informative gradient-based learning. 
Unlike first order algorithms that often adopt the same 
learning rate for all coordinates, the second order online 
learning algorithms adopt different distills to the step 
size employed for each coordinate. A variety of second 
order online learning algorithms have been proposed 
recently. Some technique attempts to incorporate knowl¬ 
edge of the geometry of the data observed in earlier 
iterations to perform more effective online updates. For 
example, Balakrishnan et al. 1271 propose algorithms 


for sparse linear classifiers in the massive data set¬ 
ting, which requires 0(d 2 ) time and 0{d 2 ) space in 
the worst case. Another state-of-the-art technique for 
second order online learning is the family of confidence- 
weighted (CW) learning algorithms ll3l . 1251 , 1291 . 1551 . 
EDI, which exploit confidence of weights when making 
updates in online learning processes. In general, the 
second order algorithms are more accurate, converge 
faster, but fall short in two aspects (i) they incur higher 
computational cost especially when dealing with high¬ 
dimensional data; and (ii) the weight vectors learned 
are often not sparse, making them unsuitable for high¬ 
dimensional data. Recently, Duchi et al. address the spar¬ 
sity and second order update in the same framework, 
and proposed the Adaptive Subgradient method I15T1 
(Ada-RDA), which adaptively modifies the proximal 
function at each iteration to incorporate knowledge 
about geometry of the data. 

2.4 Cost-Sensitive Online Learning 

Cost-sensitive classification has been extensively studied 
in data mining and machine learning. In the past decade, 
a variety of cost-sensitive metrics have been proposed to 
tackle this problem. For example, the weighted sum of 
sensitivity and specificity 1321 . and the weighted misclas- 
sification cost 1551 . l34l . Both cost-sensitive classification 
and online learning have been studied extensively in 
data mining and machine learning communities, respec¬ 
tively. There are only a few works on cost-sensitive online 
learning. For example, Wang et al. (Til proposed a family 
of cost-sensitive online classification framework, which 
are designed to directly optimize two well-known cost- 
sensitive measures. Zhao and Hoi 1551 tackle the same 
problem by adopting the double updating technique 
and propose Cost-Sensitive Double Updating Online 
Learning (CSDUOL). 

3 Sparse Online Learning for Data 
Stream Classification 

In this section, we first introduce a general sparse online 
learning framework for online data stream classification, 
and then provide the theoretical analysis on the frame¬ 
work. The framework will be used to derive the family of 
first-order and second-order sparse online classification 
algorithms in the following section. 

3.1 General Sparse Online Learning 

Without loss of generality, we consider the sparse online 
learning algorithm for the binary classification problem, 
which is also mentioned as sparse online classification 
problem in this paper. The sparse online classification 
algorithm generally works in rounds. Specifically, at the 
round t, the algorithm is presented one instance x t £ R , 
then the algorithm predicts its label as 

Vt = sign(w f r xt), 
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where w t e R“ is linear classifier maintained by the algo¬ 
rithm. After the prediction, the algorithm will receive the 
true label y t € {+1, —1}/ and suffer a loss f t (w t ). Then, 
the algorithm would update its prediction function w t 
based on the newly received (x t ,yt)- The standard goal 
of online learning is to minimize the number of mis¬ 
takes suffered by the online algorithm. To facilitate the 
analysis, we firstly introduce several functions. Firstly, 
the hinge loss £ t (w; (x t , y t )) = [1 - y t w T x t ]+, where 
[a]+ = max(a, 0), is the most popular loss function for 
binary classification problem. Given a series of d-strongly 
convex functions with respect to the norms 

|| • ||and the dual norms || • ||J . The proposed general 
sparse online classification (SOC) algorithm is shown in 
Algorithm [2| 


Algorithm 2 General Sparse Online Learning (SOL) 

INPUT :A, r) 

INITIALIZATION : 6>i = 0. 
for t = 1,. .. ,T do 
receive x t e R d ; 
u t = V$ t *(0 t ); 

w t = argmin w A||u t - w||| + A t ||w||i; 
predict y t = sign(w t r x t ); 
receive y t and suffer f t (w t ) = [1 - y t wjx t } + ; 
if f t (w t ) > 0 then 

Ot+i =0 t ~ r/tzt, where z t = Vf t (w t ); 

end if 
end for 


3.2 Theoretical Bound Analysis 

In this section, we analysis the regret Rt of the general 
sparse online learning (SOL) algorithm. Firstly, we will 
present a key lemma, which will facilitate the following 
analysis. 

Lemma 1: Let <f > t ,t = 1,...,T be ^-strongly convex 
functions with respect to the norms || • ||and let || • ||$ be 
the respective dual norms. Let $(0) = 0, and xi,... ,xj 
be an arbitrary sequence of vectors in Mr. Assume that 
algorithm |2] is run on this sequence with the function <ty, 
Then, we have the following inequality 

T 

^]j) t (w f -w) T Z t <$ T (w) (1) 

t=1 

+ E [$t (^t) ~ + ^|ll z i|l5>* + %A t ||z t ||i , 

t= l 

for any w, and any A > 0. 

Proof: Firstly, define A t = [d t +i) - (Of, then 


where the final inequality is due to Fenchel's inequality. 
In addition, we have 


a* = *:(0t+i) - mot) + m 

< *Wt) - - ??t (V$:(0t)) T Zt + |||zt||| r 

Combining the above two inequalities, we get 

T T 

- E ~ $ t(w) < At 

t =i t =i 

< EK(« t ) - + fMltV 

t= 1 

Rearranging the above inequality, we get 

T 

y^77t(u t - w) T z t 

t =1 

< *r(w) + X> t *(0 t ) - *U(0t) + IN|l t 4 (2) 

t= 1 

Now, we would connect w, 1 x t and u, 1 x t as follows: 

d d 

wjz t = Wt,iZt,i = ^2 sign(ti t ,i)[|ut,i| ~ At]+^t,» 

2=1 2=1 

= II w mI - At]+|zt,i| - ^2 [Kil ~ x t]+\ z t,i\ 

< E \^t,i 11 ~t ,i | ^ ^ ( \ut y 11 i | “t“ At|^t,ij) 

< E H - ^ ^ “I - |) 

u t ,iz t ,i> 0 u t ,iZ t ,i< 0 

< u^zt + A t ||z f ||i. 


Plugging this inequality into inequality 10 will conclude 
the lemma. □ 

Given this general lemma, we would provide a general 
corollary, which could directly upper bound the regret 
suffered by this framework. To derive this kind of corol¬ 
lary, we only need to lower bound the left hand side of 
the inequality 0 by using ft(w t ) — f t (w) < (w t — w) T z t , 
which is the property of convex function. 

Corollary 1: Under the assumptions of Lemma 1, if we 
further assume £ is convex and ry = rj, then the regret 

Rt = XEiM w t) ~ m in w XEi M w ) of the proposed 
framework 0 satisfies the following inequality 


Rt A 


T>t(w) 


T 

^[^IMlt+A t |N|i] + 


ELiA t* 


t=1 


26 1 


(3) 


where A* = $J(0 t ) - $£_i(0 t ). 

Given this framework and these analysis, we would 
drive some specific algorithms and their regret bounds. 


Ea* 


= <t>* T (e T+ i) ~ mef 


> w t $t+i — < Ft(w), 


$t(^T+i) 


4 Derived Algorithms 

In this section, we will first recover the RDA |22] al¬ 
gorithm and then derive algorithm utilizing the second 
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order-information. In this section, we will adopt the 
hinge loss function and denote C = {f|f t (w t ) > 0}. We 
denote L t = I(£ t ( Wf )>o)/ where I„ is indicator function, 
= 1 if v is true, otherwise = 0 . 

4.1 First Order Algorithm 

Set <F t (w) = i||w|| 2 , which is 1-strongly convex with 
respect to || • || 2 - And it is known that the dual norm 
of || • ||2 is || • ||2 itself, while = <F t . Under these 

assumptions, we get the first order sparse online learning 
(FSOL) algorithm, which is the same with Regularized 
Dual Averaging (RDA) algorithm with soft 1-norm reg¬ 
ularization 1221 . 


Algorithm 3 First Order Sparse Online Learning (FSOL) 
INPUT :A, r) 

INITIALIZATION : 6 >, = 0. 
for t = 1,... ,T do 

receive x* £ 

w t = sign (8 t ) © [\6 t \ - A t ]+; 

predict y t = sign(w t r x t ) and receive y t £ {— 1 , 1 }; 
suffer l t (wt) = [1 - y t wjx t } + ; 

6 t+1 = 9 t + r]L t y t x t ; 

end for 


Theorem 2: Let (xi,yi),..., (xt,j/t) be a sequence of 
examples, where x t £ R d , y t £ {— 1 ,+ 1 } and ||x t ||i < 
X for all t. If further set X t = rjX, then the regret 
Rt = - min w suffered by the 

algorithm (O is bounded as follows: 


Rt © 


t =1 


T 

E 

t =l 


yXX, 


for any w £ R d . Further setting 77 = 
could have 


IMP 

y/(X 2 + 2 \X)T' 


we 


4>*(w) is iw T A t 1 w, while ||w|||, = w T A t 1 w. Using 
the Woodbury identity, we can incrementally update the 
inverse of A t as At 1 = At 1 -, — A-i**** A-i Under 

4 4 t ~ 1 r+xj 

these assumptions, we get the second order sparse online 
learning (SSOL) algorithm. 


Algorithm 4 Second Order Sparse Online Learning 
(SSOL) 


INPUT :A, rj 

INITIALIZATION : 6>i = 0. 


for t = 1,... ,T do 


receive x t £ 
At X =A-\ 
u t = 


R d ; 

4-1 -v-.-v- T 4 —1 


r+xj A t 2 iX, ' 


w t = sign(u t ) © [|u t | - A t ]+; 

predict y t = sign(w f r x t ) and receive y t £ {— 1 , 1 }; 

suffer ^t(wt) = [1 - y t wjx t \ + ; 

0t +1 = 0t+ r]L t y t x t ; 

end for 


Theorem 3: Let (xi,yi),..., (xt,j/t) be a sequence of 
examples, where x t £ R d , y t £ {— 1 ,+ 1 } and ||x t || 1 < 
X for all t. If further set A* = X/t, then the regret 
Rt = lEiM w t) - min w XEiM w ) suffered by the 
algorithm H} is bounded as 

Rt < E + ^rd\og((l + Er)) + AA[log(T) + 1], 

for any w £ {w |w T Arw < D 2 }. 

Proof: Firstly, it is easy to observe 

a : = ^ejAf%-\ejAf\e t 

=_EEuEilE < 0 . 

2(r + xjA t \x t ) 


Rt < D^(X 2 + 2A X)T, 


for any w £ {w |||w h<D}. 

Proof: Firstly = <Fj!(0 t ) — = 0, then 

according to corollary {!}, we have 


Rt < 


< 


illwin 


E[oll I/ 4 ^ X 4 ll 2 + Atll-Ltj/tXtll 


E a ' 2 + E^aaa 


t=1 


t=1 


□ 

Remark: This bound indicates the regret of this algo¬ 
rithm is upper bounded by ()(\ff ), which recovers the 
results in 1221 . 


4.2 Second Order Algorithm 

Set 4>t(w) = |w T A t w, where A t = A t _i + X * X| ,r > 0 
and Aq = I. It is easy to verify that <\> t is 1-strongly con¬ 
vex with respect to ||w||| = w 1 A f w. Its dual function 


Then according to the conclusion in the corollary |l}, we 
have 


Rt — 


w T drw 
2?7 


+ E A t lx t + A t || U t 2/ t x t ||i] 


t =1 
T 


< w J 4 I w + ^ x t a - 1xi+x £ Ai 


2tj 


t =1 
T 


< w E w + \ E x * T A t lxt + AA [ lo s( T ) +!] > 

^ z t=i 


where the final inequality used Ylt=i A < [log(T) + 1], 
Secondly, the second term of the right hand side can be 
upper bounded as 


E 


x 7 A lx * = r J2( 1 


det(A f _i) 

det(At) 


< 


—r 


E lo §( 


t =1 


det(A t _i) 

det(A t ) 


) = rlog(det(A T )). 
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Combining the above two inequalities gives 

W T 4'j’W 


Rt < 


2r, 


+ |rlog(det (A t )) + AA[log(T) + 1]. (4) 


Since A T = I + 


T x t x. 


■, its eigenvalue //, satisfies 


£=1 r 

Hi < 1 + trace(y^ ———) = 1 + 

t=i 7 t=i 

As a result, we have 


I x t|l2 


det(A-r) — < (1 H— ~'R) d - 

»=l 

Plugging the above inequality into (ED will concludes this 
theorem. □ 

Remark: According to this theorem, adopt the second 
order information for the sparse online learning does fur¬ 
ther minimize the regret bound to an order of 0(log(T)). 


anomaly detection, where the class distribution is often 
highly imbalanced. In this section, we propose a cost- 
sensitive sparse online classification algorithm by ex¬ 
tending the sparse online learning framework for online 
anomaly detection tasks. Without loss of generality, we 
assume the positive class is the rare class in a set of 
streaming data, which contains more positive examples 
than negative samples. We will prefer a high cost/lost 
value when a positive sample is misclassified, while a 
small cost/lost value when a negative sample is misclas¬ 
sified. 

Specifically, we respectively denote the number of 
positive samples and negative sample by T + and T_; 
and M + , M_ are the number of false negative and false 
positive, respectively. We denote T = T + + T_ and M = 
M + + M_. Instead of using the cost-insensitive metric 
accuracy = 7 7 - A; , researchers have proposed a variety 
of cost-sensitive metrics. One well-know cost-sensitive 
metric is the weighted sum of sensitivity = T+ T + /+ and 

specificity = , which is defined as follows: 


4.3 Diagonal Algorithm 

Although the previous second order algorithm signif¬ 
icantly reduced the regret bound than the first order 
algorithm, it will consume 0(d 2 ) time, which limits 
its application to real-world high dimension problems. 
To keep the computational time still 0{d) similar with 
the traditional online learning, we further explored its 
diagonal version, which will only maintain a diagonal 
matrix. Its details are in the Algorithm ([5]). 


Algorithm 5 Diagonal Second Order Sparse Online 
Learning 

INPUT :A, r) 

INITIALIZATION : 6>i = 0. 
for t = 1,. .. , T do 
receive x t e 

a-i _ a -i - A r- 1 i dia g(xtx7)^r_ 1 i. 

** ~ ‘- 1 r+ X jA;\ Xt ' 

u t = 

w t = sign(u f ) © [|u t | - At]+; 

predict y t = sign(w t r x t ) and receive y t € {—1,1}; 
suffer £ t (-w t ) = [1 - y t wjx t ] + ; 

Qt+i = 0 t + r}L t y t x t ; 

end for 


In the following experiments, we mainly adopt the 
diagonal second order sparse online learning algorithm 
unless otherwise specified, which is also denoted as 
"SSOL". 

4.4 Cost-Sensitive Algorithm 

For the previous algorithms, the classifier is cost- 
insensitive, which suffers the same cost/lost when the 
positive samples and the negative samples are misclas¬ 
sified. It is inappropriate for many data stream classi¬ 
fication tasks in real-world applications, such as online 


T+ - M+ TL - Af_ 
sum = p+ -—-bp_-—-, 

where p+ + p_ = 1 and 0 < p+,p_ < 1 are two 
parameters to trade off between sensitivity and speci¬ 
ficity. In general, the higher the sum value, the better the 
classification performance. Notably, when p + = //_ = 
0.5, the corresponding sum is the well known balanced 
accuracy If32l . 

In general, the higher the sum value, the better the 
classification performance. To maximize the sum value, 
based on the previous framework, we propose a cost- 
sensitive sparse online classification algorithm following 
the theoretical analysis in [11], 13511 . In particular, we 
adopt a modified hinge loss function: 

(pl yt =i +I yt= _i)[l - y t w T x t ]+, 

where p = 11 1 7 ,'. and !„ is an indicator function, which 
= 1 if v is true, otherwise I, ; = 0. In our experi¬ 
ment, we use the balance accuracy as the metric and 
set p+ = /( = 0.5. Generally, it is difficult to predict 

the number of positive and negative samples T + and 
T_ in advance. So a more realistic setting is to use 
two weight parameters c+ and c_ for the positive and 
negative losses, respectively. Hence, the loss function is 
reformulated as: 

(c+ 1^=1 + c_I yt= _i)[l - y t w T x t ] + . 

Denoting c t = c+l yi ^ i + i, the modified regret is 

derived as follow: 

Rt = ^2 - X] c ‘^( w )> 

t t 

where £ t {w t ) = [1 - y t wjx t } + . 

Based on the proposed sparse online learning frame¬ 
work and the cost-sensitive loss function, we can achieve 
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Algorithm 6 Cost-Sensitive First Order Sparse Online 
Learning (CS-FSOL) 

INPUT : A, rj, c+i, c_i 
INITIALIZATION : 6>i = 0, 1 = /. 

for t = 1,... ,T do 

receive x t £ 

w t = sign(0 t ) © [|0 t | - A t ]+; 

predict y t = sign(w t r x t ) and receive y t £ {—1,1}; 
suffer f t (w t ) = [1 - ?/ t w t T x t ] + ; 

0t+i =0 t + T)Cy t L t y t yL t ) 

end for 


the cost-sensitive first order sparse online learning algo¬ 
rithm (CS-FSOL) shown in Algorithm [6] 

For this algorithm, it is easy to observe that if we treat 
r\c Vt as r\ t , then it is the special case of the proposed 
framework 0 with $ t (w) = ^||w|||. So, we would like 
to prove a new corollary for the proposed framework 0, 
under the situation that rj t = rjc Vt . This can be achieved 
by combining Lemma Q] with 7]c Vt [f t (w t ) — £ t (w)] < 
r]t(vft ~ w) T z t . Specifically, we have the following corol¬ 
lary: 

Corollary 4: Under the assumptions of Lemma 1, if 
we further assume t is convex and rj t = r l ( ' Vl , then 
the regret R T = £f=i CyJ t {w t ) - min w ^Li c yA(™) 
of the proposed framework 0 satisfies the following 
inequality 


Rt © 


$ T (w) 

V 



+ AtHcjfcZtHi] + 



, (5) 


where A* = $* (B t ) - ^jL^t). 

Given the above corollary, we can prove the following 
theorem for Algorithm [6] 

Theorem 5: Let (x!, y {),..., (x T , y T ) be a sequence of 
examples, where x t £ R d , y t £ {—1,+1} and Hx^! < 
X for all t. If further set A t = 77A, then the regret 

R t = Tl=i c yJt{wt) - min wELi c ytM w ) suffered by 
the algorithm 0 is bounded as follows: 


Rt C 



T T 

9 X] C VtX 2 + 

Z t=1 t=l 


for any w S R d . 
Further setting rj 

have 


IMb 

y/(X 2 +2\X)(T + c++T_c-)' 


we could 


Rt < D^/(X 2 + 2A X)(T+c+ + T_c_), 

for any w £ {w |||w||2 < D}. 

We omit the proof, since it is easy. 

In addition, we can also get the cost-sensitive second 
order sparse online classification (CS-SSOL) algorithm 
shown in Algorithm [7] However, its time complexity and 
space complexity are relatively high for high dimension 
datasets, we will only use its diagonal variant in practice, 
where only a diagonal Al[ l is maintained and updated. 


Algorithm 7 Cost-Sensitive Second Order Sparse Online 
Learning (CS-SSOL) 


\ Vr £+1/ £—1 


INITIALIZATION : 6 1 = 0, A^ 1 = I. 
for t = 1,..., T do 
receive x t £ 


a^=a;\- 

Ar 1 6t 


at 


r+xjA^^t ' 


U t = A t o t ; 

W t = sign(ut) 0 [|u t | - At]+; 

predict y t = sign(w t r x t ) and receive y t £ {—1,1}; 
suffer ft(wt) = [1 - y t wjxt\+; 

0t +1 = S t + t]Cy t Lty t xu 

end for 


It is easy to verify that this algorithm is the special 
case of the proposed framework 0, when r\ t = yc yt 
and $t(w) = lw ' ,4,w. So, the corollary |4] holds for 
this algorithm. Using this corollary, we can prove the 
following theorem for Algorithm |7| 

Theorem 6: Let (xi, y\ (xt,t/t) be a sequence of 
examples, where x t £ y t £ {— 1,+1} and ||x t ||i < 
X for all t. If further set A t = A /t, then the regret 
Rt = Cyjtiwt) ~ mm w J2t=i c yM™) suffered by 
the algorithm 0 is bounded as 

jy2 

Rt <-^ + Cmax^rd log(l + —T) + c max \X^og(T) + 1], 

for any w £ {w |w t Atw < D 2 }, where c max = 
max(c+, c_). 

The proof of this theorem is omitted, since it is easy and 
can mainly follows the one for Theorem |3] 

5 Experiments 

In this section, we conduct an extensive set of exper¬ 
iments to evaluate the performance of the proposed 
sparse online classification algorithms on both synthetic 
and real datasets. 

5.1 Experimental Setup 

In our experiments, we compare the proposed algo¬ 
rithms with a set of state-of-the-art algorithms, includ¬ 
ing the sparse online learning algorithms and the cost- 
sensitive online learning algorithms. The methodology 
details of these algorithms are listed in Table [l] The three 
existing algorithms (CS-OGD, CPA and PAUM) are cost- 
sensitive online learning without sparsity regularizer. 

To examine the binary classification performance, be¬ 
side the synthetic dataset, we evaluate all the previous 
algorithms on a number of benchmark datasets from 
web machine learning repositories. Table |2] shows the 
details of all the datasets in our experiments. These 
datasets are selected to allow us evaluate the algorithms 
on various characteristics of data, in which the number 
of training examples ranges from thousands to millions, 
feature dimensionality ranges from hundreds to about 















TABLE 1 

List of Compared Algorithms. 


Algorithm 

lst/2nd Order 

Sparsity 

Description 

STG 

FOBOS 

Ada-FOBOS 

Ada-RDA 

FSOL 

SSOL 

First Order 

First Order 
Second Order 
Second Order 
First Order 
Second Order 

Truncate Gradient 
Truncate Gradient 
Truncate Gradient 
Dual Averaging 
Dual Averaging 
Dual Averaging 

Stochastic Gradient Descent l8l 

FOrward Backward Splitting ETI 

Adaptive regularized FOBOS 1311 

Adaptive regularized RDA |[3T| 

The proposed Algorithm [3] 

The proposed Algorithm [5] 

CS-OGD 

CPA 

PAUM 

CS-FSOL 

CS-SSOL 

First Order 

First Order 

First Order 

First Order 
Second Order 

Non-Sparse 

Non-Sparse 

Non-Sparse 

Dual Averaging 
Dual Averaging 

Cost-Sensitive Online Gradient Descent fill 

Cost-Sensitive Passive-Aggressive SI 

Cost-Sensitive Perceptron Algorithm with Uneven Margin l36l 
The proposed Algorithm [6] 

The proposed Algorithm [7] 


16-million, and the total number of non-zero features on 
some dataset is more than one billion. For the very large- 
scale WEBSPAM dataset, we run the algorithms only 
once. The sparsity as shown in the last column of the 
table denotes the ratio of non-active feature dimensions, 
as some feature dimensions are never active in the 
training process, which is often the case for some real- 
world high-dimensional dataset, such as WEBSPAM. 

We conduct experiments by following standard online 
learning settings for training a classifier, where an on¬ 
line learner receives a single training example at each 
iteration and updates the model sequentially. We will 
examine how different sparsity levels affect test error 
rate of the classifier trained from a single pass through 
the training data. Besides, we also measure time cost 
of different algorithms to evaluate the computational 
efficiency. To make a fair comparison, all the algorithms 
adopt the same experimental settings. We use hinge loss 
as the loss function for the applicable algorithms. To 
identify the best set of parameters, for each algorithm 
on each dataset, we conduct a 5-fold cross validation 
for grid searching the parameters with the fixed spar¬ 
sity regularization parameter A = 0. In particular, the 
learning rates are searched from 2~ 1 to 2 9 and the other 
parameters are searched from 2~ 5 to 2 5 . With the best 
tuned parameters, each algorithm is evaluated for 5 
times with a random permutation of a train set. All the 
experiments were conducted on a Linux server (with 
Intel Xeon CPU E5-2620 @2.00GHz, 4 CPU cores, 8GB 
memory) and the programming environment is based 
on C++ implementation compiled by g++. 

5.2 Experiment on Synthetic Dataset 

To evaluate if the proposed sparse online learning al¬ 
gorithm is able to identify effective features for learning 
the models, we design the first experiment on a synthetic 
dataset, which allows us to control the exact numbers of 
effective/noisy feature dimensions. In particular, we gen¬ 
erate a synthetic dataset with high dimensionality and 
high sparsity by following the similar scheme in (281 , 
(29], which contains a set of effective feature dimensions 
that are correlated with the class labels and a set of noisy 
feature dimensions that are uncorrelated with the labels. 


Specifically, we generate the synthetic dataset with 
100,000 training examples and 10,000 test examples in 
l+ooo. p Qr example, the first 100 dimensions are 
drawn from a multivariate Gaussian distribution with 
diagonal covariance. Each dimension of the mean vector 
is uniformly sampled from —1 to 1, and each dimension 
of covariance is uniformly sampled from 0.5 to 100. We 
generate the split plane the same as the mean vector. To 
introduce noisy feature dimensions, we randomly choose 
200 noise dimensions out of the rest 900 dimensions 
for each example. Noises are drawn from a Gaussian 
distribution of AT ( 0,100). 



sparsity (%) 


Fig. 1. Test error rate of sparse online classification on 
synthetic dataset. 

We evaluate all the cost-insensitive sparse online clas¬ 
sification algorithms on the synthetic dataset. Figure |T] 
shows the test error rates of all the compared algorithms, 
where the right diagram is a sub-figure of the left one 
with sparsity from 80% to 100%. Several observations 
can be drawn from the experimental results. 

First of all, we observe that the test error rates of 
the truncate gradient based algorithms (STG, FOBOS, 
Ada-FOBOS) decrease significantly when the sparsity 
level increases. By contrast, for the dual averaging based 
algorithms (FSOL, Ada-RDA, SSOL), the test error rates 
keep stable or even decrease when the sparsity level 
increases; But the test error rate increases dramatically 
when the sparsity level is higher than 90%—the actual 

















TABLE 2 

List of real-world datasets in our experiments. 
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DataSet 

Balance 

#Train 

#Test 

#Feature Dimension 

#Nonzero Features 

Sparsity(%) 

T+\T_ 

AUT 

True 

40,000 

22,581 

20,707 

1,969,407 

3.07 

1 \ 0.33 

PCMAC 

True 

1,000 

946 

7,510 

55,470 

3.99 

1 \ 1.00 

NEWS 

True 

10,000 

9,996 

1,355,191 

5,513,533 

29.88 

1 \ 1.50 

RCV1 

True 

781,265 

23,149 

47,152 

59,155,144 

8.80 

1 \ 1.11 

URL 

True 

2,000,000 

396,130 

3,231,961 

231,249,028 

7.44 

1 \ 2.02 

WEBSPAM 

True 

300,000 

50,000 

16,071,971 

1,118,027,721 

95.82 

1 \ 0.64 

URL2 

False 

1,000,000 

100,000 

3,231,961 

114,852,082 

44.96 

1 \99 

WEBSPAM2 

False 

100,000 

10,000 

16,071,971 

224,201,808 

96.19 

1 \ 99 


sparsity level used for generating the synthetic data. The 
result indicates that the dual averaging based algorithms 
more effectively exploit the sparsity in the dataset. Sim¬ 
ilar observation was also reported in |22| who argued 
that the dual averaging based methods take more ag¬ 
gressive truncations and thus can generate significantly 
more sparse solutions. Second, the proposed second- 
order algorithm SSOL achieves the lowest error rate 
among all the compared algorithms, especially for high 
sparsity level. This observation can be seen more clearly 
in the right diagram of Figure [U The above encouraging 
experimental results indicate that the proposed SSOL 
algorithm can effectively exploit the sparsity for solving 
the sparse online classification tasks. 

5.3 Test Error Rate on Large Real Datasets 

In this experiment, we compare the proposed algorithms 
(FSOL and SSOL) with the other cost-insensitive algo¬ 
rithms on several real-world datasets. Table |2] shows the 
details of six datasets, which can be roughly grouped 
into two major categories: the first two datasets (AUT 
and PCMAC) are general binary small-scale datasets 
and the corresponding experimental results are shown in 
Figure|2](a)-(b); and the rest four datasets (NEWS, RCV1, 
URL, and WEBSPAM) are large-scale high-dimensional 
sparse datasets and the corresponding experimental re¬ 
sults are shown in Figure [2] (c)-(f). We can draw several 
observation from these results as follows. 

First of all, we observe that most algorithms can 
learn an effective sparse classification model with only 
marginal or even no loss of accuracy. For example, in 
Figure [2] (d), the performances of all the algorithms 
are almost stable when sparsity level is smaller than 
80%. It indicates that all the compared sparse online 
classification algorithm can effectively explore the low 
level sparsity information. 

Second, for most cases, we observe that there exists 
some sparsity threshold for each algorithm, below which 
test error rate does not change much; but when sparsity 
level is greater than the threshold, test error rate gets 
worse quickly. 

Third, we observe that the dual averaging based sec¬ 
ond order algorithms (Ada-RDA and SSOL) consistently 
outperform the other algorithms (STG, FOBOS, FSOL, 
and Ada-FOBOS), especially for high sparsity level. This 


indicates that the dual averaging technique and second 
order updating rules are effective to boost the classifica¬ 
tion performance. 

Finally, when the sparsity is high, an essential re¬ 
quirement for high-dimensional data stream classifica¬ 
tion tasks, the proposed SSOL algorithm consistently 
outperforms the other algorithms over all the evaluated 
datasets. For example, when the sparsity is about 99.8% 
for the WEBSPAM dataset (the total feature dimension¬ 
ality is 16,609,143), the test error rate of SSOL is about 
0.3%, while the Ada-RDA is 0.4% and the Ada-FOBOS 
is 0.55%, as shown in Figure [2] (f). 

5.4 Running Time on Large Real Datasets 

We also examine time costs of different sparse online 
classification algorithms, and the experiment results are 
shown in Figure [3] In this experiment, we only adopt 
the four high-dimensional large-scale dataset. Several 
observations can be drawn from the results. 

First of all, we observe that when the sparsity level 
is low, the time costs are generally stable; on the other 
hand, when the sparsity level is high, the time cost of 
the second other algorithms sometimes will somewhat 
increase. For example, the test costs of Ada-FOBOS, Ada- 
RDA and FSOL in Figure|3](b) & (d). One possible reason 
may be that when the sparsity level is high, the model 
might not be informative enough for prediction and 
thus may suffer significant more updates. Since second- 
order algorithms are more complicated than first-order 
algorithms, they are more sensitive to the increasing 
number of updates. 

Second, we can see that the proposed SSOL algo¬ 
rithm runs more efficiently than another second-order 
based algorithms (Ada-RDA and Ada-FOBOS). It is even 
sometimes better than the first order based algorithm 
(e.g. FOBOS and STD). However, the first order FSOL 
algorithm is consistently faster than the second order 
SSOL algorithm. 

In summary, from the above analysis, we found that 
the proposed SSOL algorithm is able to achieve the 
comparable or even better accuracy of existing second- 
order algorithms, but has the comparably small time cost 
as state-of-the-art first-order algorithms with truncated 
gradient methods. 
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(a) AUT 


(b) PCMAC 


(c) NEWS 





sparsity (%) 



(d) RCV1 (e) URL (f) WEBSPAM 

Fig. 2. Test error rate on 6 large real datasets, (a)-(b) are two general datasets, (c)-(f) are four large-scale high¬ 
dimensional sparse datasets. The second and forth rows are the sub-figures of the first and the third rows with high 
sparsity level, respectively. 
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Fig. 3. Time cost on four large-scale datasets: NEWS, RCV1, URL, and WEBSPAM 


5.5 Applications on Online Anomaly Detection 

Our last two experiments are to explore the proposed 
sparse online classification technique with application 
to an online anomaly detection task, i.e., malicious URL 
detection and web spam detection, where the class dis¬ 
tribution is imbalanced in real-world scenarios. 

5.5.1 Malicious URL Detection 

In this experiment, we evaluate the cost-sensitive based 
online learning algorithms for malicious URL detection 
task with the benchmark dataset that can be downloaded 
fromQ. The original URL data set is created in purpose 
to make it somehow class-balanced, and it has already 
been used in some previous studies. 

In this experiment, we create a subset (denoted as 
"ULR2") by sampling from the original data set to make 
it close to a more realistic distribution scenario where the 
number of normal URLs is significantly larger than the 
number of malicious URLs. Following the experiment 

1. http:/ /sysnet.ucsd.edu/projects/url/ 


setting in l35l , we choose 10,000 positive (malicious) 
instances and 990, 000 negative (normal) instance. Hence, 
the ratio T + \ T_ = 1 \ 99. For test dataset, we collect 
100,000 samples from the original test set with the same 
ratio. More details of the unbalanced URL dataset are 
shown in Table [2] 

We compare the proposed CS-FSOL and CS-SSOL with 
three other cost-sensitive algorithms (CS-OGD, CPA, and 
PAUM), as shown in Table Q] In addition, we com¬ 
pare all the cost-insensitive based algorithms to evaluate 
the classification accuracy without adopting the cost- 
sensitive lost function. The experiment results are shown 
in Figure 01 where CS-OGD, CPA, and PAUM are non- 
sparse online learning algorithms and thus are invariant 
to the sparsity. 

Several observations can be drawn from the results. 
First of all, all the cost-sensitive algorithms perform 
consistently better than their cost-insensitive versions. 
This indicates that the proposed cost-sensitive algorithm 
with cost-sensitive loss functions is able to effectively 
resolve the class-imbalance problem. Second, among all 
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Fig. 4. Balanced accuracy of different algorithms for 
malicious URL detection. 

the cost-insensitive algorithms, the second order on¬ 
line learning algorithms are generally better than the 
first order algorithms. Third, among all the compared 
algorithms, the proposed CS-SSOL algorithm achieves 
the best performance, which again validates the efficacy 
of the proposed technique for real-world data stream 
classification applications. 

5.5.2 Web Spam Detection 



Fig. 5. Balanced accuracy of different algorithms for web 
spam detection. 

In this experiment, we evaluate the proposed cost- 
sensitive based online learning algorithms for web spam 
detection task. We constructed an unbalanced subset of 
the original web spam dataset used in Section 15.31 In 
particular, for the train dataset, we randomly choose 
1,000 positive instances and 99,000 negative instances. 
Hence, the ratio T + \ T_ of the training set is 1 \ 99. For 
test dataset, we collect 10,000 samples from the original 
test set with the same positive-negative ratio. 


We denote the imbalance web spam dataset as "WEB- 
SPAM2". More details of the unbalanced web spam 
dataset are shown in Table 12 As we can see, the feature 
dimension of WEBSPAM2 dataset (16,071,971) is much 
higher than the one of URL2 (3,231,961), and feature 
representations of WEBSPAM2 dataset are extremely 
sparse (96.19% versus 44.96%). Hence, the anormaly 
detection task on WEBSPAM2 dataset is very challenge 
with high-dimensional sparse features and unbalanced 
data distributions. The experiment settings in this section 
are the same with Section [5.5.11 where all cost-sensitive 
and cost-insensitive algorithms are compared. The ex¬ 
periment results are shown in Figure [5] 

Several observations can be drawn from the results. 
First of all, for this sparse classification problem, the 
performances of non-sparse cost-sensitive algorithms 
decrease significantly. In particular, the cost-insensitive 
algorithms SSOL and Ada-RDA outperform the cost- 
sensitive algorithm CAP and PAUM. Second, similar 
to the previous experiment, the second order online 
learning algorithms are generally better than the first 
order algorithms among all the cost-insensitive / cost- 
sensitive algorithms. Third, the proposed CS-SSOL algo¬ 
rithm consistently achieves the best performance, which 
again validates the efficacy of the proposed technique 
for real-world data stream classification applications. 

6 Conclusions and Future work 

In this paper we introduced a framework of sparse on¬ 
line classification (SOC) for large-scale high-dimensional 
data stream classification tasks. We first showed that 
the framework essentially includes an existing first-order 
sparse online classification algorithm as a special case, 
and can be further extended to derive new sparse on¬ 
line classification algorithms by exploiting second-order 
information. We also extend the proposed technique to 
solve cost-sensitive data stream classification problems 
and explore its applications to online anomaly detection 
tasks: malicious URL detection and web spam detection. We 
analyzed the performance of the proposed algorithms 
with both theoretical analysis and empirical studies, in 
which our encouraging experimental results showed that 
the proposed algorithms are able to achieve the state-of- 
the-art performance in comparison to a large family of 
diverse online learning algorithms. 
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